\section{Selection} \subsection{MVA} \label{sec:selection:mva} In order to further reduce the background contribution, a multivariate analysis is performed. The combinatorial background sample used in the training has been selected to be the upper sideband above $5600\mev$ of the vertex constrained $\Bu$ invariant mass, whereas the signal sample is defined as the reweighted MC described above. \subsubsection{Optimize algorithm} Several classifiers are examined. These are trained and tested on the samples using a cross-validation technique due to the limited amount of events available. A stratified k-folding strategy is applied. This works as follows: \begin{enumerate} \item The data (both signal and background) is split into $k$ sub-samples, each containing the same fraction of a certain class\footnote{A class in this context refers to the "label" or the "y"; here we have two classes, signal and background.}. \item A training set consisting of $k-1$ sub-samples and a testing set consisting of one sub-sample are created. \item The algorithm is trained on the training set and tested on the testing set. \item Predictions made by the algorithm on the test set are collected. \item This is done $k$ times, every time with a different sub-sample as testing set, so that in the end a prediction for every event is made. \end{enumerate} For the evaluation and comparison of the performance, the ROC AUC is used with the goal to maximise it. Before the classifiers are compared against each other, a hyper-parameter optimisation is performed for each\footnote{For DNN, several well-performing architectures from other analyses were used as inspiration for the set-up, then tested and varied.} classifier. The best performing algorithm for our case is a boosted decision tree (BDT) implementation, the extreme gradient boosting (XGB) algorithm with DT as base classifiers\cite{Breiman,AdaBoost,ML:XGBoost}. Similar performance is obtained by other algorithms such as random forests and deep neural networks(DNN). The random forest averages the predictions of several DT but uses for our application critically more memory while not outperforming the XGB. The DNN is on one hand intrinsically hard and time-consuming to train and to optimise its architecture. On the other hand they are in general able to outperform most of the other classifiers but usually only with low-level features and a lot of data available in order to get the extra correlations and not just pick up noise. As there are only high-level features and a limited amount of data available, seeing no superior performance from the DNN was expected. The final configuration used for the XGB in the selection can be seen in Table \ref{tab:xgbconfig}. \begin{table}[tb] \begin{center}\begin{tabularx}{\textwidth}{lcX} \hline Parameter & Value & Explanation\\ \hline n\_estimators & 500 & Number of base classifiers (DT); equals to number of boosting rounds to be performed.\\ \hline eta & 0.1 & A factor the weights of each boosting stage is multiplied by. There is a trade-off between eta and n\_estimators and the ratio determines mostly how complex our model is. In other boosting algorithms, this is usually called the "learning rate". \\ \hline max\_depth & 6 & Maximum depth of the DT. Higher values create more complex models which are able to get higher order correlations.\\ \hline gamma & 0 & Determines the minimum gain required to perform a split. Larger values create more conservative models \\ \hline subsample & 0.8 & The fraction of the data that is used to train each DT. Reduces over-fitting. \\ \end{tabularx}\end{center} \caption{ %\small %captions should be a little bit smaller than main text Hyper-parameter configuration of the XGB classifier used for the selection.} \label{tab:xgbconfig} \end{table} \subsubsection{Feature selection} To achieve a good discrimination power, it is crucial to use the appropriate input variables. Badly simulated features are not used in order to avoid training of MC against real data instead of signal against background. In addition, any direct correlation of the features with the \Bu mass has to be avoided in order to perform an unbiased yield estimation later on. \begin{table}[tb] \begin{center}\begin{tabularx}{\textwidth}{llX} \hline Particle & Variable & Explanation \\ \hline \Bu, \jpsi, $\kaon_1$ & $log(\pt)$ & \ptexpl \\ & $log(\chisqvtx)$ & \chisqvtxexpl \\ & $log(\chisqip)$ & \chisqipexpl \\ & $log(\chisq_{FD})$ & \chisqfdexpl \\ \hline \Bu & $log(DIRA)$ & \diraexpl \\ & $log(AMAXDOCA)$ & \amaxdocaexpl \\ & $log(\chisq_{VTX iso}\ one\ track)$ & Measure for the isolation of the reconstructed track by removing the track under consideration and repeat the reconstruction.\\ & $log(\chisq_{VTX iso}\ two\ track)$ & \\ \hline $\kaon_1$, $\jpsi$ & $log(1 - \cos(\theta))$ & \thetaexpl \\ \hline \pip, \pim & log(\pt) & \ptexpl \\ \hline \end{tabularx}\end{center} \caption{Variables used in the training of the XGB classifier.} \label{tab:xgbvariables} \end{table} \begin{figure}[tb] \centering \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(1_-_b_dira_ownpv).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(1_-_jpsi_costheta).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(1_-_k1_1270_costheta).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(b_amaxdoca).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(b_endvertex_chi2).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(b_fdchi2_ownpv).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(b_ipchi2_ownpv).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(b_pt).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(b_vtxisodchi2onetrack).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(b_vtxisodchi2twotrack).pdf} \end{subfigure} % % 10 plots above % \caption[selection features]{Features used in the training of the BDT.} \label{fig:selection:features} \end{figure} \begin{figure}\ContinuedFloat \centering \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(jpsi_endvertex_chi2).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(jpsi_fdchi2_ownpv).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(jpsi_ipchi2_ownpv).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(jpsi_pt).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(k1_1270_endvertex_chi2).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(k1_1270_fdchi2_ownpv).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(k1_1270_ipchi2_ownpv).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(k1_1270_pt).pdf} \end{subfigure} % \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(piminus_pt).pdf} \end{subfigure} \begin{subfigure}{0.45\linewidth} \includegraphics[width=\linewidth]{figs/selection_features/log(piplus_pt).pdf} \end{subfigure} \caption{Features used in the training of the BDT.} \end{figure} \subsubsection{Prediction and performance} In order to get reliable predictions from the classifier, a k-folding strategy is used on the data sample. Therefore, the data is split into k folds and the right sideband of the training sample is trained against the MC signal. The algorithm makes predictions on the full data, not just on the right sideband, of the fold not used in the training. To apply an optimal cut based on the predictions, there are several Figure of Merits (FoM) available. Depending on the goal of the analysis, a different one may be chosen. As this study aims for the first detection, the Punzi FoM \begin{equation} FoM_{Punzi} = \frac{S}{\sqrt{B} + \sigma/2} \end{equation} with $\sigma = 5$ is selected and maximized\cite{Punzi:2003bu}. This yields the highest sensitivity for an observation with a significance of $5~\sigma$. \begin{figure}[bt] \begin{subfigure}{0.5\textwidth} \centering \includegraphics[width=\linewidth]{figs/selection/predictions_of_xgb_classifier_small.eps} \caption{} \label{fig:selection:xgbpredictions} \end{subfigure}% % \begin{subfigure}{0.5\textwidth} \centering \includegraphics[width=\linewidth]{figs/punzi_fom_vs_cut/punzi_fom_vs_cut-8-zoomed_max.pdf} \caption{} \label{fig:selection:fomvscuts} \end{subfigure} \caption{The output of the XGB for the performance evaluation is shown in \ref{fig:selection:xgbpredictions} resulting in a ROC AUC of 0.986. Background (bck) refers to the right sideband. In \ref{fig:selection:fomvscuts}, the optimisation of the cut is determined. The Punzi FoM is plotted against the cut applied on the predictions. The optimal cut is at $94\%$.} \end{figure} \subsection{Efficiencies} \label{sec:selection:efficiency} Not all particles produced in a collision are captured by the detector. There is a limited, geometrically determined detector acceptance range. To determine how many events are lost outside this range, the efficiency of the remaining events is calculated using generated events. For the \Btokpipiee sample used in this thesis the efficiency is $0.148 + \order(10^{-4})$. \begin{table}[tb] \caption[efficiencies]{ %\small %captions should be a little bit smaller than main text Efficiencies of the different cuts. For every cut, the above ones are applied as well. The relative efficiency refers to the loss because of this cut with respect to the previous cut.} \begin{center}\begin{tabular}{c|l|c} Number of events & Cut added & Relative efficiency \\ \hline $2,065,330$ & Geometrical & $14.8\%$ \\ %\hline $66,185$ & Stripping & $3.20\%$ \\ %\hline $25,622$ & \hlt & $38.7\%$ \\ %\hline $14,192$ & \qsq region & $55.4\%$ \\ %\hline $11,254$ & \nameref{sec:preselection} & $79.3\%$ \\ $7789$ & \nameref{sec:selection:mva} & $69.2\%$ \\ \end{tabular}\end{center} \label{tab:efficiency} \end{table} An overview over the cuts applied so far as well as their respective efficiency is given in Table \ref{tab:efficiency}. Applying all cuts yields a total efficiency of $\etot = 0.0557\%$.